Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
Please be advised that the material in this case study describes and discusses rates of suicide, as well as rates and symptoms of depression.
According to the National Institute of Mental Health (NIMH):
If you are in crisis and need help, call this toll-free number for the National Suicide Prevention Lifeline (NSPL), available 24 hours a day, every day: 1-800-273-TALK (8255). The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at www.suicidepreventionlifeline.org.
The Crisis Text Line is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to 741741 and a trained crisis counselor will respond to you with support and information over text message. Visit www.crisistextline.org.
Also see here for more information about how to recognize and help youths experiencing symptoms of depression.
Motivation
Rates of depression appear to have been increasing among American youths since around 2010 according to a recent report. A recent study also shows that youths appear to be seeking more care from mental health services.
This case study will explore how rates of major depressive episodes have changed since the early 2000s and across different youth subgroups (age, gender, ethnicity). We also will explore how rates of treatment for depression of youths have changed over time.
Photo by K. Mitch Hodge on Unsplash
The major symptoms of a major depressive episode include:
Sleep disorder (increased or decreased)
Interest deficit (anhedonia)
Guilt (worthlessness, hopelessness, regret)
Energy deficit
Concentration deficit
Appetite disorder (increased or decreased)
Psychomotor retardation or agitation
Suicidality

Click here to see the diagnostic requirements for a major depressive episode (MSE) according to the DSM 5.
A. Five or more of the following symptoms have been present and documented during the same two-week period and represent a change from previous functioning; at least one of the symptoms is either (1) depressed mood or (2) loss of interest or pleasure.
Note: Do not include symptoms that are clearly attributable to another medical condition.
Depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g., feels sad, empty, hopeless) or observation made by others (e.g., appears tearful)
Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day (as indicated by either subjective account or observation)
Significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body weight in a month), or decrease or increase in appetite nearly every day
Insomnia or hypersomnia nearly every day
Psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)
Fatigue or loss of energy nearly every day
Feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly every day (not merely self-reproach or guilt about being sick)
Diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)
Recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide
B. The symptoms do not meet criteria for a mixed episode.
C. The episode is not attributable to the physiological effects of a substance or to another medical condition.
Note: Criteria A-C represent a major depressive episode.
Note: Responses to a significant loss (e.g., bereavement, financial ruin, losses from a natural disaster, a serious medical illness or disability) may include feelings of intense sadness, rumination about the loss, insomnia, poor appetite and weight loss noted in Criterion A, which may resemble a depressive episode. Although such symptoms may be understandable or considered appropriate to the loss, the presence of a major depressive episode in addition to the normal response to a significant loss should also be carefully considered. This decision inevitably requires the exercise of clinical judgment based on the individual’s history of and the cultural norms for the expression of distress in the context of loss.
D. The occurrence of the major depressive episode is not better explained by schizoaffective disorder, schizophrenia, schizophreniform disorder, delusional disorder, or other specified and unspecified schizophrenia spectrum and other psychotic disorders.
E. There has never been a manic episode or a hypomanic episode.
Note: This exclusion does not apply if all of the manic-like or hypomanic-like episodes are substance-induced or are attributable to the physiological effects of another medical condition.
This case study is motivated by the following two articles:
Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. J Abnorm Psychol.128,3 (2019):185-199. doi:10.1037/abn0000410
Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. JAMA Psychiatry. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.
The main findings of the first article are:
Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.
Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.
Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.
While the main findings of the second article are:
Compared with adult mental health care, the mental health care of young people has increased more rapidly…
This means that the number of youths receiving mental health care has increased faster than the number of adults receiving mental health care.
Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses … increased significantly faster for youths (from 7.78 to 15.30 visits) than for adults (from 23.23 to 28.48 visits) (interaction: P < .001).
Psychiatrist visits also increased significantly faster for youths (from 2.86 to 5.71 visits).
Summary: While depression appears to be on the rise for youths, youths also appear to be seeking more mental health care.
In this case study, we will be using data from the National Survey on Drug Use and Health (NSDUH) related to treatment and major depressive episode (MDE) rate to explore how trends in mental health have changed over time and how different groups compare.
This data was also used in the first referenced article.
Main Questions
Our main questions:
- How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
- Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?
Learning Objectives
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives:
- Scrape data directly from a website (
rvest)
- Subset and filter data (
dplyr)
- Write functions to wrangle data repetitively
- Work with character strings (
stringr)
- Reshape data into different formats (
tidyr)
- Data visualizations (
ggplot2) with labels (directlabels) and facets for different groups
- Combine multiple plots (
cowplot)
- Optional: Create an animated gif (
magick)
Statistical Learning Objectives:
- Discuss the impact of self-reporting bias on survey responses
- Define and create a contingency table
- Implementation of a chi-squared test for independence
- Interpretation of a chi-squared test for independence
In this case study, we will especially focus on using packages and functions from the Tidyverse, such as rvest. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.

We will begin by loading the packages that we will need:
library(here)
library(rvest)
library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)
library(directlabels)
library(scales)
library(forcats)
library(ggthemes)
library(cowplot)
Packages used in this case study:
| here |
to easily load and save data |
| rvest |
to scrape web pages |
| dplyr |
to subset and filter the data for specific groups, to replace specific values with NA, rename variables, and perform functions on multiple variables |
| magrittr |
to use and reassign data objects using the %<>%pipe operator |
| stringr |
to manipulate strings |
| tidyr |
to change the shape or format of tibbles to wide and long |
| tibble |
to create tibbles and convert values of a column to row names |
| purrr |
to apply a function to each column of a tibble or each tibble in a list |
| ggplot2 |
to create plots |
| directlabels |
to add labels directly to lines in plots |
| scales |
to get the current linetype options |
| forcats |
to reorder factor for plot |
| ggthemes |
to create a plot to see what the different linetypes look like |
| cowplot |
to combine plots together |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Context
To motivate the examination of the mental health of American youths, we begin by exploring the rate of suicide in the United States (US). According to the CDC the rate of suicide has increased for both genders.

While suicide does appear to be increasing among youths it also appears to be increasing among most age groups in the US over the past decade and a half for both females and males:


According to the CDC:
Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States.

Furthermore, according to the CDC:
In 2016, suicide became the second leading cause of death among youths.
So although suicide is on the rise for most age groups, suicide is one of the top two contributors to death for youths.
Thus, this warrants further examination of the mental health of American youths.

Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.

Besides the US, other countries are also experiencing increased rates of depression in youths.
See this report from the World Health Organization (WHO) about rates of depression in other countries.
See here for an interesting discussion about what may be causing increased depression rates.
Limitations
There are some important considerations regarding this data analysis to keep in mind:
The data that we will use come from a survey and are therefore values from a sample that estimate that of the true population. In our statistical analysis we use these sample values as if they are population estimates (because this is all we have access to). Thus, our results are not necessarily indicative of population differences.
Furthermore, the sampling mechanism utilized for the survey can introduce selection bias in cases where the the sampling methods do not produce a representative sample.
Data are collected from human participants; this presents the potential for information bias, as there is the potential that participants in the sampling frame may for a variety of reasons report inaccurate information.
Data about certain group intersections (meaning for example individuals of a particular gender and ethnicity) or particular groups in general such as specific ethnicities or gender or sexual identity groups such as LGBTQIA+ (lesbian/gay/bisexual/transgender/queer and questioning) or non-binary gender populations is unfortunately not available in the data used in this analysis and in most research about this topic.
Note: While gender and sex are not actually binary, the data used in this analysis unfortunately only contains information for groups of individuals who self-reported as male or female. We also acknowledge that unfortunately not all ethnicities or group intersections are represented in the data either. More research should be devoted to collecting data about the mental health of these groups.
What are the data?
We will be using data from the National Survey on Drug Use and Health (NSDUH) which is directed by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).
This survey started in 1971 and is conducted annually in all 50 states and the District of Columbia. Approximately 70,000 people (ages 12 and up) are interviewed each year about health-related issues. Only civilian, non-institutionalized individuals are included. Households are randomly selected and then a professional interviewer visits the addresses and asks one or two of the residents to interview. The interviewer brings a laptop with them that the participants use to fill out the survey, which typically takes an hour to complete. If a participant chooses to participate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particularly focused on drug and alcohol use as well as mental health. See here for more details about the survey.
The data are made available publicly online on the Substance Abuse & Mental Health Data Archive.

On the website with the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.

If you click on the TOC button on the far left upper corner, you will be directed to another website, where a large pdf document containing all of the results can be downloaded.
We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus, the following tables are of interest to us:
| Table 11.1A |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018 |
| Table 11.1B |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018 |
| Table 11.2A |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.2B |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018 |
| Table 11.3A |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018 |
| Table 11.3B |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018 |
| Table 11.4A |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.4B |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018 |
Our goal is to bring these data into R so we can explore them.
Click here for the NSDUH defines a major depressive episode (MDE)
According to the NSDUH 2018 report
Respondents were defined as having had an MDE in the past 12 months if they had at least one period of 2 weeks or longer in the past year when they experienced a depressed mood or loss of interest or pleasure in daily activities, accompanied by problems with sleeping, eating, energy, concentration, or self-worth. The MDE questions are based on diagnostic criteria from DSM-5. Some of the wordings of the depression questions for adolescents aged 12 to 17 and adults aged 18 or older differed slightly to make the questions more developmentally appropriate for adolescents.
Adolescents were defined as having an MDE with severe impairment if their depression caused severe problems with their ability to do chores at home, do well at work or school, get along with their family, or have a social life.
Data Import
Data are often made available online. Sometimes, the data we are interested in is made available for download on a web page as a delimited text file or an excel file. However, sometimes data are not made available in this manner, such as the NSDUH survey data.
How do we proceed in this scenario?
We can manually copy each cell of data; however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way.
Alternatively, we could use R to scrape the data from the web!
Formally, web scraping is the process of extracting data from a webpage. Let’s learn how to do this for our case study.
Basic steps of web scraping
There are two main steps to web scraping:
Identify location of data on the webpage that will be scraped.
Save the webpage element to an R object.
We accomplish STEP 1 with our web browser.
We accomplish STEP 2 in the R programming environment.
The location of the data on the webpage that will be scraped can be identified using a language called XPath, which is short for XML Path Language. It is used to identify pieces (in this case called nodes) of a document written in the XML language. XML which is short for Extensible Markup Language is frequently used for documents on the internet, similar to HTML. One of the major differences between these two is that HTML does not provide structural information, while XML does. This structural information can be used to parse documents so that we can scrape only the data that we are interested in from a website.
Additional resources for web scraping:
The rvest package
In this case study, we will scrape data from the tables on the NSDUH survey website.
Note that these data are available in a large PDF with all the results by year if you wish to use the data from this particular source.
One option to import the data would be to import the PDF. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract our data of interest from the pdf. However, if one wanted to do this, say if the tables were not available online, they could use the pdftools package. See this other case study and this other case study for two methods to work with PDFs.
Another option could be to copy and paste the data from the website to another file that we would also need to import. But this would not be as efficient or reproducible and might result in errors.
Alternatively, we will use the rvest package to scrape the data directly from the tables on the website.
Assuming the data next year would be displayed in a similar manner, this could allow us to simply modify our code based on the url for the data next year to run the same analysis on the data easily.
However, it is important to keep in mind that one downside of scraping the data directly from the web, is that the website could change - this can be a good thing if the website adds additional data and keeps the same formatting. This would allow us to get additional data very easily. However, if the website changes formatting then this would require that we update our code.
Scraping tables into R
The two web scraping steps for these tables can be broken down even further:
Identify location of data that will be scraped
- right-click to inspect element (webpage)
- hover pointer over components of element (webpage) until the data has been found
- copy XPath of data sought
Save webpage element to an object in R
- import html code for the webpage
- extract pieces of HTML documents (webpage) using XPath
- parse the extracted data into a data frame
Below is a animated overview of the process.
Click here if you want to see how this animation was made!
First the images need to be imported into R using the image_read function of the magick package.
step1 <- magick::image_read(here::here("img", "webpage_screenshot.png"))
step2 <- image_read(here::here("img", "table_screenshot_inspect.png"))
step3 <- image_read(here::here("img", "table_screenshot_inspect_table.png"))
step4 <- image_read(here::here("img", "table_screenshot_inspect_table_xpath.png"))
step5 <- image_read(here::here("img", "table_screenshot_xpath_copy_r.png"))
step5_zoom <- image_read(here::here("img", "table_screenshot_xpath_copy_r_zoom.png"))
The last image is smaller than the others, to get a sense of the size we can use the image_info() function of the magick package.


[90m# A tibble: 1 x 7[39m
format width height colorspace matte filesize density
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m
[90m1[39m PNG [4m1[24m440 900 sRGB TRUE [4m3[24m[4m0[24m[4m6[24m274 72x72
[90m# A tibble: 1 x 7[39m
format width height colorspace matte filesize density
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m
[90m1[39m PNG 869 231 sRGB TRUE [4m5[24m[4m7[24m559 72x72
First let’s re-size the second image to make it a bit larger using the image_resize() function of the magick package. We will re-size the width to be the same as the previous image width and keep the aspect ratio for the height by using “1440x”. If we wanted to just do the same for height we would use “x900”.
step5_zoom <- image_resize(step5_zoom, "1440x")
step5_zoom

We can add a white boarder around the last image to make the size more similar height-wise using the image_border() function of the magick package. There are many image modification functions in the magick package! See here to learn more.
step5_zoom <- image_border(step5_zoom, "white", "2x334")
step5_zoom

Looks good!
Now we will make the sequence of images for our animation. We also want to indicate how long we want to spend on each relative to the others. We want to linger on the last image so we include it two times.
img <- c(step1,
step2,
step3,
step4,
step5,
step5_zoom,
step5_zoom)
Now, we are ready to create our gif! But first we want to modify our images a bit more.
First we want to make all images within img the exact same size using the image_resize() function. To do this for all images we can use the ! at the end, which ignoring aspect ratios.
[90m# A tibble: 7 x 7[39m
format width height colorspace matte filesize density
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m
[90m1[39m PNG [4m1[24m439 855 sRGB TRUE [4m1[24m[4m8[24m[4m9[24m980 72x72
[90m2[39m PNG [4m1[24m436 857 sRGB TRUE [4m2[24m[4m3[24m[4m2[24m355 72x72
[90m3[39m PNG [4m1[24m439 857 sRGB TRUE [4m3[24m[4m1[24m[4m5[24m277 72x72
[90m4[39m PNG [4m1[24m439 856 sRGB TRUE [4m3[24m[4m4[24m[4m6[24m714 72x72
[90m5[39m PNG [4m1[24m440 900 sRGB TRUE [4m3[24m[4m0[24m[4m6[24m274 72x72
[90m6[39m PNG [4m1[24m444 [4m1[24m051 sRGB TRUE 0 72x72
[90m7[39m PNG [4m1[24m444 [4m1[24m051 sRGB TRUE 0 72x72
img <-image_resize(img, '1440x900!')
image_info(img)
[90m# A tibble: 7 x 7[39m
format width height colorspace matte filesize density
[3m[90m<chr>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m
[90m1[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m2[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m3[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m4[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m5[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m6[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
[90m7[39m PNG [4m1[24m440 900 sRGB TRUE 0 72x72
We also want to morph or blend each image into the next so that there appears to be a smooth transition. We can also specify how many frames to include in the morph, to speed up or slow down the blend from one image to another. We will specify that 4 frames should be used in the morph by using the image_morph() function.
To make the final animation we use the image_animate() function Importantly, we want to delay changing from one image to another about 70* 1/100 seconds to give people a chance to see what is happening. So we can use the delay argument. The optimize argument of this function requires that all images are the same size (luckily we did this!) and it causes R to only store the differences between frames.
educational_gif <-
image_morph(img, frames = 4) %>%
image_animate(delay = 70,
optimize = TRUE)
Now to save our gif we can use the image_write() function of the magick package and the here() function of the here package to easily save it in a directory called img within the directory that contains our .Rproj file. We will name the file educational.gif.
image_write(educational_gif,
here::here("img", "educational.gif"))
